Possession, purchase or sale of firearms and ammunition has been quite common in the United States for centuries. However, increasing acts of violence and even deaths associated with guns also deserve to be mentioned, especially after the pandemic of COVID-19. Therefore, with the help of **FBI gun data**, an analysis could be carried out to figure out the geographics, growth and other trends for the firearms in the U.S.
The gun data is found in FBI's national instant criminal background check system, or [NICS](https://www.fbi.gov/services/cjis/nics), while the census data comes from [U.S. Census Bureau](https://www.census.gov/).
# import required libraries for data analysis and plotting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# load U.S. census data
census = pd.read_csv('U.S. Census Data.csv')
# load U.S. gun data as CSV file
pd.read_excel('gun_data.xlsx').to_csv('gun_data.csv')
gun = pd.read_csv('gun_data.csv')
After data assessment, it is quite clear that the two datasets are hard to clean and filter, because of:
Very large scale: gun data has more than 12,000 rows, and census data owns around 80 attributes.
Missing data: lots of NaNs are found in both datasets, many letters are used for special numerics in census data.
Incorrect data types: all datatypes for census data are objects, while most of them shall be ints or floats.
Transpose needed: states are used for columns in census data, while appear in rows for gun data.
**Therefore, careful investigation and selection must be carried out prior to the data cleaning process, in order to decrease workload.*
The data types and formats can be quite different, not just among columns but also among rows.
First of all, the missing values shall be dropped, including the supplementary notes. Based on those notes, we could locate special letters inside the dataset and deal with them. After that, further cleaning processes could be carried out.The dataframe needs to be transposed, in order to turn states into index and facts into columns. Names of 'Fact' are quite long, which needs to be replaced by shorter names. Finally, after various steps, a cleaned dataframe was created as census_df.
# take a brief look into the size and general information of data
census.head();
print(census.shape)
(85, 52)
# detect missing value amount in each column:
census.isnull().sum();
# return index of the first missing value in each column:
a = []
for i in range(52):
a.append(census[census.iloc[:, i].isnull()].index.tolist()[0])
print(a)
# conclusion 1: 'Fact Note' starts to have missing value since beginning, while all other columns start since index 65
[65, 0, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65]
# Based on results above,
census.iloc[65:, :];
# conclusion 2: start from row 65, those supplementary contents shall be dropped since they are all NaN
# create a new dataframe with useful content for data analysis, using states as index and facts as column:
census_0 = census.iloc[:65, :].drop(['Fact Note'], axis=1).transpose()
census_0.columns = census_0.iloc[0, :]
census_0 = census_0.drop(['Fact'], axis=0).drop(['FIPS Code'], axis=1)
# find out in which column we could find 'D', 'F', 'FN', 'NA', 'S', 'X', 'Z', as mentioned on the bottom of orginal dataframe
s = census_0.shape
for v in range(s[1]):
for i in range(s[0]):
if type(census_0.iloc[i, v]) == str and census_0.iloc[i, v] in 'DFNASXZ':
print(census_0.iloc[i, v], census_0.iloc[:, v].name)
# conclusion 3: these lines are not closely related to data analysis or have alternatives, could be dropped
census_0.head();
# conculsion 4: all of the rest data can be converted to numerical values
# conclusion 5: '%' and ',' shall be removed first to avoid error during conversion
Z Native Hawaiian and Other Pacific Islander alone, percent, July 1, 2016, (V2016) Z Native Hawaiian and Other Pacific Islander alone, percent, July 1, 2016, (V2016) Z Native Hawaiian and Other Pacific Islander alone, percent, July 1, 2016, (V2016) Z Native Hawaiian and Other Pacific Islander alone, percent, July 1, 2016, (V2016) D Total manufacturers shipments, 2012 ($1,000) D Total manufacturers shipments, 2012 ($1,000) FN Total employment, percent change, 2014-2015
# make selections among columns, clean them and correct datatypes:
pd.options.mode.chained_assignment = None # default='warn'
census_1 = census_0.iloc[:, np.r_[0, 2, 6, 8, 10, 13, 18, 19, 26, 28, 33:36, 37, 47:50, 51, 62]];
d = census_1.shape
for v in range(d[1]):
for i in range(d[0]):
num = census_1.iloc[i, v]
if ',' in num or '$' in num:
num = float(num.strip().replace(',', '').replace('$', '')) #change a string $1,000, 1,000 or $1000 to a float 1000
elif '%' in num:
num = float(num.strip().replace('%', ''))/100 #change a string 15% to a float 0.15
else:
num = float(num.strip()) #change rest strings into floats
census_1.iloc[i, v] = num
census_df = census_1.astype(float)
# keep on data refining:
census_df['population% w/ age between 18 and 65'] = 1 - census_df.iloc[:, 2] - census_df.iloc[:, 3]
census_df['employment rate%'] = census_df['Total employment, 2015'] / census_df.iloc[:, 0]
census_df.drop(['Persons under 18 years, percent, July 1, 2016, (V2016)', 'Persons 65 years and over, percent, July 1, 2016, (V2016)'], axis=1, inplace=True)
census_df.drop('Total employment, 2015', axis=1, inplace=True)
census_df.columns = ['population', 'pop. change%', 'female%', 'black%', 'latino%', 'white%', 'mo. median mortgage', 'mo. median rent', 'foreign language speaking%', 'high school% for age>25', 'univ% for age>25', 'uninsured%', 'median household income', 'annual income per capita', 'poverty%', 'population/square mile', 'population% w/ age between 18 and 65', 'employment rate%']
census_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 50 entries, Alabama to Wyoming Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 population 50 non-null float64 1 pop. change% 50 non-null float64 2 female% 50 non-null float64 3 black% 50 non-null float64 4 latino% 50 non-null float64 5 white% 50 non-null float64 6 mo. median mortgage 50 non-null float64 7 mo. median rent 50 non-null float64 8 foreign language speaking% 50 non-null float64 9 high school% for age>25 50 non-null float64 10 univ% for age>25 50 non-null float64 11 uninsured% 50 non-null float64 12 median household income 50 non-null float64 13 annual income per capita 50 non-null float64 14 poverty% 50 non-null float64 15 population/square mile 50 non-null float64 16 population% w/ age between 18 and 65 50 non-null float64 17 employment rate% 50 non-null float64 dtypes: float64(18) memory usage: 7.4+ KB
This column doesn't have any missing values and sounds like total gun sales, which looks really good at first glance, however...
By calculation we found 'total' is the sum of all numeric values in a row, including new and regained gun permits, normal gun sales, and even returns as well as redemptions, which make the value 'total' doesn't make much sense at all: it cannot represent anything.
# take a brief look into the size and general information of data
gun.head();
print(gun.shape)
(12485, 28)
# check the amount of missing values for different columns, turned out to be huge
gun.isnull().sum();
# conclusion 1: no missing data for 'Unnamed: 0', 'month', 'state', 'multiple', 'totals'
# conclusion 2: small missing values for 'permit', 'handgun', 'long_gun', 'admin'
# test whether the column 'totals' is the sum of rest columns with numerical values, print True if yes:
a = gun.sum().shape[0]
gun.sum()[3:a-1].sum() == gun.sum()[a-1]
# conclusion 3: 'totals' = sum of all columns in a row except 'Unnamed:', 'month' and 'state'
# conclusion 4: we cannot use 'totals' to represent gun growth, since permits, redemptions and returns are included
True
# test whether each row represents monthly accruals or total amount in certain month, by using 'totals':
gun['month'] = pd.to_datetime(gun['month']) # change datatype of 'month' for better plotting
gun.query('state == "Texas"').plot(x='month', y='totals', figsize=(10, 5), title='Trend of gun totals in Texas');
# conclusion 5: due to drastic fluctuation and apparent seasonality of 'totals', each row shall represent monthly accruals
# which columns shall we use for analysis?
a = gun.sum().shape[0]
gun.sum();
gun_trades = gun.sum()[a-1] - gun.sum()[2:4].sum() # 'totals' - 'permit' - 'permit_recheck', sum of gun trades only
a1 = gun.sum()[4:6].sum() / gun_trades
a2 = gun.sum()[4:9].sum() / gun_trades
a3 = a1 / a2
print('a1 = '"{:.1%}".format(a1)) # normal handgun & long gun purchases makes 91.0% of gun trades
print('a2 = '"{:.1%}".format(a2)) # normal gun purchases / all gun trades makes 91.0% of gun trades
print('a3 = '"{:.1%}".format(a3)) # conclusion 6
# conclusion 6: for analytical convenience, handgun and long_gun purchase can be used to represent normal gun purchase in U.S.
a1 = 91.0% a2 = 94.2% a3 = 96.6%
# census data range from 2010 to 2016 and counts annual data only, therefore a annual gun data since 2010 shall be created:
gun.query('month >= "2010-01"').isnull().sum();
# surprisingly, we can see better data quality here, compared to the original data
# create annual gun data from 2010 to 2016, with columns of 'year', 'state', 'permit' and 'gun_purchase'
gun_df = gun.query('"2016-12" >= month >= "2010-01"').iloc[:, 1:10] # select data from 2010 to 2016
gun_df['gun_growth'] = gun_df.iloc[:, 3:].sum(axis=1) # create the column of 'gun_purchase'
gun_df.groupby([pd.Grouper(key='month',freq='Y'), 'state'], as_index=True).sum() # turn dataset into annual changes
gun_df.rename(columns = {'month':'year', 'permit':'permit_growth'}, inplace=True)
gun_df = gun_df.drop(gun_df.columns[3:9], axis=1) # drop unwanted columns
gun_df.duplicated()
print('There are', sum(gun_df.duplicated()), 'duplicated lines and', sum(gun_df.isnull().sum()), 'missing values.')
gun_df.info()
# gun data cleaning finished
There are 0 duplicated lines and 0 missing values. <class 'pandas.core.frame.DataFrame'> Int64Index: 4620 entries, 495 to 5114 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 4620 non-null datetime64[ns] 1 state 4620 non-null object 2 permit_growth 4620 non-null float64 3 gun_growth 4620 non-null float64 dtypes: datetime64[ns](1), float64(2), object(1) memory usage: 180.5+ KB
Some questions need to be answered:
Which characters are closely related to gun sales in U.S.?
With positive or negative correlations?
Is there any seasonality for gun sales or registration?
Can we find some geographic features for the gun data?
Any general trends for gun sales or registration during past decades?
Many new data frames are created, to perform correlation tests between attributes.
Population Growth, Population Density and Gun Growth
# permit and gun growth 2010-2016 (gun_df only has 2010-2016 data):
test1 = gun_df.groupby('state').sum();
# population growth 2010-2016:
test1['pop_growth'] = census_df['population'] * census_df['pop. change%'];
# population density and gun growth per capita:
test1['pop_density'] = census_df['population/square mile']
test1['gun_density'] = test1['gun_growth'] / census_df['population']
test1.dropna(inplace=True)
x=test1['pop_growth']
y1=test1['gun_growth']
y2=test1['permit_growth']
x1=test1['pop_density']
y3=test1['gun_density']
fig, (ax1, ax2) = plt.subplots(1, 2)
print('correlation coefficient:')
print('------------------------')
ax1.plot(x, y1, 'o');
m, n = np.polyfit(x, y1, 1)
ax1.plot(x, m*x + n);
ax1.set_title('gun growth vs population growth')
print('population growth and gun growth:', round(np.corrcoef(x, y1)[1, 0], 2))
# conclusion 1: there is a strong positive correlation between population growth and gun growth
ax2.plot(x, y2, 'o', color='green');
m, n = np.polyfit(x, y2, 1)
ax2.plot(x, m*x + n);
ax2.set_title('permit growth vs population growth')
print('population growth and permit growth:', round(np.corrcoef(x, y2)[1, 0], 2))
fig.set_figheight(5)
fig.set_figwidth(10)
# conclusion 2: there is a weak positive correlation between population growth and gun growth
correlation coefficient: ------------------------ population growth and gun growth: 0.76 population growth and permit growth: 0.1
median = test1['pop_density'].median()
high = test1['pop_density'] >= median
low = test1['pop_density'] < median
test1['gun_density'][high].hist(alpha=0.5, bins=20, label='high population density')
test1['gun_density'][low].hist(alpha=0.5, bins=20, label='low population density')
plt.legend();
plt.title('Histogram of states with different levels of gun density', size = 13);
plt.xlabel('gun density', fontsize=10);
print('Correlation Coefficient between population density and gun density is', round(np.corrcoef(x1, y3)[1, 0], 2))
# conclusion 3: due to the tradition of hunting, population density is negatively correlated to gun density.
Correlation Coefficient between population density and gun density is -0.52
Gender, Race and Gun Growth
# create a new test:
if test1.index.all() == census_df.index.all():
test2 = pd.concat([test1['gun_density'], census_df.iloc[:, 2:6]], axis=1)
test2.dropna(inplace=True)
x01=test2['female%']
x02=test2['black%']
x03=test2['latino%']
x04=test2['white%']
y0=test2['gun_density']
plt.plot(x01, y0, 'o', color='pink');
plt.title('Female% and Gun Density', size = 15)
plt.xlabel('female%')
plt.ylabel('gun density')
m, n = np.polyfit(x01, y0, 1)
plt.plot(x01, m*x01 + n);
print('Correlation Coefficient between female% and gun density is', round(np.corrcoef(x01, y0)[1, 0], 2))
# conclusion 1: there is an obvious negative correlation between female% and gun density
Correlation Coefficient between female% and gun density is -0.45
fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
print('correlation coefficient between race and gun density:')
print('-----------------------------------------------------')
ax1.plot(x02, y0, 'o');
ax1.set_title('black%')
ax1.set_ylabel('Gun Density')
m2, n2 = np.polyfit(x02, y0, 1)
ax1.plot(x02, m2*x02 + n2, color='purple');
print('black% - gun density:', round(np.corrcoef(x02, y0)[1, 0], 2))
ax2.plot(x03, y0, 'o');
ax2.set_title('latino%')
m3, n3 = np.polyfit(x03, y0, 1)
ax2.plot(x03, m3*x03 + n3, color='purple');
print('latino% - gun density:', round(np.corrcoef(x03, y0)[1, 0], 2))
ax3.plot(x04, y0, 'o', color='green');
ax3.set_title('white%')
m4, n4 = np.polyfit(x04, y0, 1)
ax3.plot(x04, m4*x04 + n4, color='orange');
print('white% - gun density:', round(np.corrcoef(x04, y0)[1, 0], 2))
fig.set_figheight(5)
fig.set_figwidth(10)
# conclusion 2: white% in a state has an obvious positive correlation to gun density
# conclusion 3: black% and latino% both have weak negative correlation to gun density
correlation coefficient between race and gun density: ----------------------------------------------------- black% - gun density: -0.22 latino% - gun density: -0.33 white% - gun density: 0.46
Poverty, Financial Status and Gun Growth
# create a new data set:
test3 = gun_df.query('"2015-12" >= year >= "2011-01"') # select data from 2011 to 2015
test3 = test3.groupby(['state']).sum()
test3['gun_density'] = test3['gun_growth'] / census_df['population'] # use population 2016 as the one in 2015
test3.dropna(inplace=True)
if test3.index.all() == census_df.index.all():
test3 = pd.concat([test3['gun_density'], census_df.iloc[:, 6:]], axis=1) # rest columns are data from 2011 to 2015
x5 = test3['annual income per capita']
x6 = test3['poverty%']
x7 = 1 - test3['employment rate%'] # to calculate approx. unemployment rate
x8 = test3['uninsured%']
z=test3['gun_density']
fig, (ax1, ax2, ax3, ax4) = plt.subplots(1, 4)
print('correlation coefficient between poverty and gun density:')
print('-----------------------------------------------------')
ax1.plot(x5, z, 'o', color='orange');
ax1.set_title('annual income')
ax1.set_ylabel('Gun Density')
c1, d1 = np.polyfit(x5, z, 1)
ax1.plot(x5, c1*x5 + d1);
print('annual income - gun density:', round(np.corrcoef(x5, z)[1, 0], 2))
ax2.plot(x6, z, 'o');
ax2.set_title('poverty%')
c2, d2 = np.polyfit(x6, z, 1)
ax2.plot(x6, c2*x6 + d2);
print('poverty% - gun density:', round(np.corrcoef(x6, z)[1, 0], 2))
ax3.plot(x7, z, 'o');
ax3.set_title('unemployment%')
c3, d3 = np.polyfit(x7, z, 1)
ax3.plot(x7, c3*x7 + d3);
print('unemployment% - gun density:', round(np.corrcoef(x7, z)[1, 0], 2))
ax4.plot(x8, z, 'o');
ax4.set_title('uninsured%')
c4, d4 = np.polyfit(x8, z, 1)
ax4.plot(x8, c4*x8 + d4);
print('uninsured% - gun density:', round(np.corrcoef(x8, z)[1, 0], 2))
fig.set_figheight(5)
fig.set_figwidth(13)
# conclusion 1: annual income has negative correlation to gun density, poverty%, unemployment%, uninsured% has positive ones
correlation coefficient between poverty and gun density: ----------------------------------------------------- annual income - gun density: -0.27 poverty% - gun density: 0.18 unemployment% - gun density: 0.21 uninsured% - gun density: 0.31
median = test3['poverty%'].median()
high = test3['poverty%'] >= median
low = test3['poverty%'] < median
test3['gun_density'][high].hist(alpha=0.5, bins=15, label='higher poverty%', color='red')
test3['gun_density'][low].hist(alpha=0.5, bins=15, label='lower poverty%')
plt.legend();
plt.title('Histogram of states with different levels of gun density', size = 13);
plt.xlabel('gun density', fontsize=10);
# conclusion 2: in general, better financial situation is correlated to lower density
test3['mortgage/income'] = test3['mo. median mortgage'] * 12 / test3['median household income']
test3['rent/income'] = test3['mo. median rent'] * 12 / test3['median household income']
fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
h1 = test3['mortgage/income']
h2 = test3['rent/income']
h3 = test3['gun_density']
print('correlation coefficient between housing burden and gun density:')
print('-----------------------------------------------------')
ax1.plot(h1, h3, 'o');
ax1.set_xlabel('mortgage/income')
ax1.set_ylabel('gun Density')
u1, v1 = np.polyfit(h1, h3, 1)
ax1.plot(h1, u1*h1 + v1);
print('mortgage/income - gun density:', round(np.corrcoef(h1, h3)[1, 0], 2))
ax2.plot(h2, h3, 'o');
ax2.set_xlabel('rent/income')
ax2.set_ylabel('gun Density')
u2, v2 = np.polyfit(h2, h3, 1)
ax2.plot(h2, u2*h2 + v2);
print('rent/income - gun density:', round(np.corrcoef(h2, h3)[1, 0], 2))
print('')
print('correlation coefficient between mortgage and rent cost:')
print('-----------------------------------------------------')
ax3.plot(h1, h2, 'o', color='orange');
ax3.set_xlabel('mortgage/income')
ax3.set_ylabel('rent/income')
u3, v3 = np.polyfit(h1, h2, 1)
ax3.plot(h1, u3*h1 + v3);
print('mortgage - rent:', round(np.corrcoef(h1, h2)[1, 0], 2))
fig.set_figheight(5)
fig.set_figwidth(15)
# conclusion 3: mortgage level has strong positive correlation to rent level for certain state in U.S.
# conclusion 4: higher housing cost is correlated to lower gun density, might due to lower necessity to own guns in metropolitan
correlation coefficient between housing burden and gun density: ----------------------------------------------------- mortgage/income - gun density: -0.5 rent/income - gun density: -0.39 correlation coefficient between mortgage and rent cost: ----------------------------------------------------- mortgage - rent: 0.72
Foreign Immigration and Gun Growth
fig, (ax1, ax2) = plt.subplots(1, 2)
m = census_df['foreign language speaking%']
n = census_df['population/square mile']
ax1.plot(m, n, 'o', color='orange');
ax1.set_xlabel('immigrants density')
ax1.set_ylabel('population density')
g, h = np.polyfit(m, n, 1)
ax1.plot(m, g*m + h);
median = test3['foreign language speaking%'].median()
high = test3['foreign language speaking%'] >= median
low = test3['foreign language speaking%'] < median
test3['gun_density'][high].plot(alpha=0.5, bins=15, label='more immigrants', color='green', kind='hist');
test3['gun_density'][low].plot(alpha=0.5, bins=15, label='less immigrants', kind='hist');
plt.legend();
fig.set_figheight(5)
fig.set_figwidth(10)
a = test3['foreign language speaking%']
b = test3['gun_density']
print('Correlation between immigrants% and gun density is:', round(np.corrcoef(a, b)[1, 0], 2))
print('Correlation between immigrants% and population density is:', round(np.corrcoef(m, n)[1, 0], 2))
# conclusion 1: states with more immigrants tend to have lower gun density
# conclusion 1: probably because new immigrants prefer to live in states with higher population density such as CA and NY
Correlation between immigrants% and gun density is: -0.53 Correlation between immigrants% and population density is: 0.38
Education Level and Gun Growth
fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
h1 = test3['high school% for age>25']
h2 = test3['univ% for age>25']
h3 = test3['gun_density']
print('correlation coefficient between education level and gun density:')
print('-----------------------------------------------------')
ax1.plot(h1, h3, 'o');
ax1.set_xlabel('high school% for age>25')
ax1.set_ylabel('gun Density')
u1, v1 = np.polyfit(h1, h3, 1)
ax1.plot(h1, u1*h1 + v1);
print('high school% for age>25 - gun density:', round(np.corrcoef(h1, h3)[1, 0], 2))
ax2.plot(h2, h3, 'o');
ax2.set_xlabel('university% for age>25')
ax2.set_ylabel('gun Density')
u2, v2 = np.polyfit(h2, h3, 1)
ax2.plot(h2, u2*h2 + v2);
print('university% for age>25 - gun density:', round(np.corrcoef(h2, h3)[1, 0], 2))
print('')
print('correlation coefficient between high school% and university%:')
print('-----------------------------------------------------')
ax3.plot(h1, h2, 'o', color='orange');
ax3.set_xlabel('high school%')
ax3.set_ylabel('university%')
u3, v3 = np.polyfit(h1, h2, 1)
ax3.plot(h1, u3*h1 + v3);
print('high school% - university%:', round(np.corrcoef(h1, h2)[1, 0], 2))
fig.set_figheight(5)
fig.set_figwidth(15)
# conclusion 1: high university rate for people above 25 years old is negatively correlated to gun density
# conclusion 2: high school rate is also positively correlated to university rate
# conclusion 3: however, we cannot find apprent relationship between high school rate and gun density
correlation coefficient between education level and gun density: ----------------------------------------------------- high school% for age>25 - gun density: 0.15 university% for age>25 - gun density: -0.41 correlation coefficient between high school% and university%: ----------------------------------------------------- high school% - university%: 0.46
Based on the gun data provided, we could observe clear seasonality for gun purchase.
The main reason could be, there are usually more holidays and larger discounts for guns during winter.
However, there is no clear seasonality for gun permits issued.
The effort and money spent on permits won't change much during a year, so people would apply whenever they want.
gun1 = gun_df.groupby(['year']).sum()
# create data frame for total monthly U.S. gun permits issued, using month as index and year as column:
gun2 = pd.DataFrame(columns = np.r_[0:12] + 1, index = np.r_[2009:2016] + 1)
for i in range(gun1.shape[0]):
p = gun1.index[i].year
q = gun1.index[i].month
gun2.iloc[p-2010, q-1] = gun1.iloc[i, 0]
# create data frame for total monthly U.S. gun purchases, using month as index and year as column:
gun3 = pd.DataFrame(columns = np.r_[0:12] + 1, index = np.r_[2009:2016] + 1)
for i in range(gun1.shape[0]):
p = gun1.index[i].year
q = gun1.index[i].month
gun3.iloc[p-2010, q-1] = gun1.iloc[i, 1]
for i in range(gun2.shape[0]):
gun2.iloc[i, :].plot(figsize=(10, 5));
plt.legend(loc='best');
plt.xticks(np.r_[0:12]+1);
plt.xlabel('month', fontsize=10);
plt.ylabel('gun purchases', fontsize=10);
plt.title('Seasonality of U.S. Gun Permits Issued', fontsize=15);
# conclusion 1: there is no obvious seasonality for U.S. gun permits issued
for i in range(gun3.shape[0]):
gun3.iloc[i, :].plot(figsize=(10, 5));
plt.legend(loc=2);
plt.xticks(np.r_[0:12]+1);
plt.xlabel('month', fontsize=10);
plt.ylabel('gun purchases', fontsize=10);
plt.title('Seasonality of U.S. Gun Purchases', fontsize=15);
# conclusion 2: it is quite clear that U.S. gun purchases usually have high season on winter and low season on summer
Based on analysis above, here we could draw a geographic view of the gun density by states in U.S.
From the map we can know, for those states with high gun density (deep color states in map):
1) Their population densities are relatively low.
2) Have quite high percentage for whites in population.
3) Have relatively low percentage for foreign language speakers in population.
In short, due to cultural reasons and hunting needs, those somewhat conservative states tend to have high gun density.
# use a dictionary for U.S. states and codes:
us_state_abbrev = {'Alabama': 'AL','Alaska': 'AK','American Samoa': 'AS','Arizona': 'AZ','Arkansas': 'AR','California': 'CA','Colorado': 'CO','Connecticut': 'CT','Delaware': 'DE','District of Columbia': 'DC','Florida': 'FL','Georgia': 'GA','Guam': 'GU','Hawaii': 'HI','Idaho': 'ID','Illinois': 'IL','Indiana': 'IN','Iowa': 'IA','Kansas': 'KS','Kentucky': 'KY','Louisiana': 'LA','Maine': 'ME','Maryland': 'MD','Massachusetts': 'MA','Michigan': 'MI','Minnesota': 'MN','Mississippi': 'MS','Missouri': 'MO','Montana': 'MT','Nebraska': 'NE','Nevada': 'NV','New Hampshire': 'NH','New Jersey': 'NJ','New Mexico': 'NM','New York': 'NY','North Carolina': 'NC','North Dakota': 'ND','Northern Mariana Islands':'MP','Ohio': 'OH','Oklahoma': 'OK','Oregon': 'OR','Pennsylvania': 'PA','Puerto Rico': 'PR','Rhode Island': 'RI','South Carolina': 'SC','South Dakota': 'SD','Tennessee': 'TN','Texas': 'TX','Utah': 'UT','Vermont': 'VT','Virgin Islands': 'VI','Virginia': 'VA','Washington': 'WA','West Virginia': 'WV','Wisconsin': 'WI','Wyoming': 'WY'}
test4 = test3.loc[:, ['population/square mile', 'foreign language speaking%']]
test4['white%'] = test2['white%']
test4 = 51-test4.rank().astype(int)
test4.columns = 'rank of ' + test4.columns
test4['state'] = test3.index
test4['statecode'] = None
for i in range(test4.shape[0]):
test4['statecode'][i] = us_state_abbrev[test4['state'][i]]
test4['gun_density'] = test3['gun_density']
import chart_studio.plotly as py
import plotly.express as px
fig = px.choropleth(test4,
locations = 'statecode',
locationmode = 'USA-states',
scope = 'usa',
color = 'gun_density',
hover_name = 'state',
hover_data = ['rank of white%', 'rank of population/square mile', 'rank of foreign language speaking%'],
range_color = [test4['gun_density'].min(), test4['gun_density'].max()],
color_continuous_scale = 'magenta',
title = 'Map of U.S. Gun Density')
fig.show()
Texas has biggest gun growth, while Hawaii has smallest. For permit growth, Kentucky ranks the top 1.
Based on annual results, U.S. permits issued and gun purchase numbers are both keep growing from 1999 to 2016.
test1 = gun_df.groupby('state').sum();
test1.sort_values('gun_growth')['gun_growth'].plot.bar(figsize=(10,5), title='Total gun purchases during 2010-2016, classified by States');
test1.sort_values('gun_growth')['gun_growth'].head(5)
# conclusion 1: Texas has biggest gun number growth during 2010-2016, Hawaii has smallest gun growth of 27 only.
state Hawaii 27.0 Mariana Islands 66.0 Virgin Islands 2024.0 District of Columbia 3549.0 Guam 9558.0 Name: gun_growth, dtype: float64
test1.sort_values('permit_growth')['permit_growth'].plot.bar(figsize=(10,5), color='green', title='Total gun permits issued during 2010-2016, classified by States');
test1.sort_values('permit_growth')['permit_growth'].head(5)
# conclusion 2: Kentucky has biggest permit number growth during 2010-2016, while in NJ, RI, VT, OK the number is almost zero.
state Puerto Rico 0.0 Mariana Islands 0.0 Vermont 0.0 Guam 0.0 New Jersey 0.0 Name: permit_growth, dtype: float64
# since cleaned files only cover data in 2010-2016, here we need get another new file from original ones:
test5 = gun.groupby('month').sum().loc[:, ['permit', 'handgun', 'long_gun']]
print(test5.index.min())
print(test5.index.max())
test6 = pd.DataFrame(index = np.r_[1998:2016] + 1)
aaa=0
bbb=0
for i in test6.index:
for v in range(test5.shape[0]):
if pd.to_datetime(test5.index[v]).year == i:
a = test5.iloc[v, 0]
b = test5.iloc[v, 1] + test5.iloc[v, 2]
aaa = aaa+a
bbb = bbb+b
test6.loc[i, 'permit_growth'] = aaa
test6.loc[i, 'gun_growth'] = bbb
ind = np.arange(test6.shape[0])
width = 0.35
plt.figure(figsize=(10,5));
plt.bar(ind, test6['permit_growth'], width, color='gray', label='Annual Permits Issued');
plt.bar(ind+width, test6['gun_growth'], width, color='crimson', label='Annual Gun Purchases');
plt.xticks(ind+width/2, test6.index);
plt.xlabel('Year', size=12);
plt.ylabel('Amount', size=12);
plt.legend();
plt.title('Annual Gun Permits Issued and Gun Purchases in U.S.', size=15);
# conclusion 3: based on annual results, U.S. permits issued and gun purchase numbers are both keep growing
1998-11-01 00:00:00 2017-09-01 00:00:00
# calculate annual gun purchase and permit growth rate:
rr = test6.shape[0]-1
aa = bb = None
aa = (test6['gun_growth'].iloc[rr] / test6['gun_growth'].iloc[0])**(1/rr)
bb = (test6['permit_growth'].iloc[rr] / test6['permit_growth'].iloc[0])**(1/rr)
print('Annual gun purchase growth rate is:', round((aa-1)*100, 2), '%')
print('Annual gun permits growth rate is:', round((bb-1)*100, 2), '%')
Annual gun purchase growth rate is: 19.57 % Annual gun permits growth rate is: 28.36 %
Which characters are closely related to gun sales in U.S.? With positive or negative correlations?
Here a list of Pearson'r is provided as correlation coefficients:
Population
Gender/Race
Poverty
Immigration
Education
*In short, states with lower population density, higher poverty rates, lower housing cost ratio for income, less immigration and lower university degree rate, tend to have higher gun density.*
Is there any seasonality for gun sales or registration?
There is no clear seasonality for gun permits issue in the U.S. For gun purchase, there is an obvious seasonality that people prefer to buy guns during the winter. The reason shall be higher discounts for guns during those holidays at the end of a year, but for permits application, there is no high or low seasons. Can we find some geographic features for the gun data?
Based on the map view of our results, gun density is higher for states that has high white%, lower immigrants% and lower population density. Those states are also relatively conservative, mainly focus on farming and hunting for business, have gun cultures and necessity to own guns as well. Any general trends for gun sales or registration during past decades?
For both gun permits issued and gun purchases, numbers kept going up during 1999 to 2016. The annualized rates of gun purchase growth and gun permit growth are 19.57% and 28.36%, respectively.
The biggest obstacle is, lots of values during early years are missing in the original data.
Firstly, we have to select data since 2010 for most of our research. Thus for long-time trends, our research is pretty raw and cannot step forward. Secondly, we have to drop some valuable data column, such as permit_recheck, so we couldn't carry out thorough research on gun permits, as what we did on gun purchases. In addition, it is impossible to calculate real gun purchases per year, since only hand gun and long gun don't have missing values during recent years. The gun purchases we calculated shall be around 9% less than actual values, which created errors for our research. Another trouble is, total numbers of guns or permits are not given, and we only have data of annual increases.
The third limitation is messy data formats. For example, most states use 20% for percentage, while some use 0.2 on the same column.
This problem can be solved by data cleaning, however, analysts must double check and be sensitive to observe those "weird data". Last problem is, for many census attributes, their timespans are different. Some data were collected in 2016, some in 2015, some represent 2010-2016 and some represent 2011-2015.
This problem can also be solved by data filtering, however, since the original data is pretty messy, this step could take a lot extra time and effort.